RedWineQuality Analysis by Anusha Poonepalle

Introduction

In this Exploratory Data Analysis project, there are 1599 observations of 13 variables in red wine dataset. Histogram and boxplot plot is constructed for each variable to know how the values for each variable is distributed. In the dataset, all the variables are numeric except quality which is categorical variable that determines the quality of wine. I would like to analyze the dataset and determine which variables would affect the quality of wine and correlation among the variables.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

Considering quality as important feature in wine analysis. The plot is as follows

Most of the wines are of average quality i.e at [5, 6].

Univariate Analysis

Observations of variables: 1. Fixed.acidity has a median at 8 and have outliers at high range. 2. Volatile.acidity has a long tail which extends upto 1.58 with median 0.5. 3. Citric.acid plot looks different which has a median at 0.26 and there is no citric acid content in wines beyond 0.80. 4. Residual.sugar and chlorides looks similar and has a long tail on right side as well as many outliers. 5. free.sulphur.dioxide, sulphates and total.sulphur.dioxide has outliers at high ranges and has a long tail pattern. 6. Density and pH has normal distribution. Most of the wines has pH value in the range of [3, 3.5] 7. Alcohol has less outliers compared to sulphates, residual.sugar, volatile.acidity, chlorides and free/total.sulphur.dioxide. Is has positive skewed distribution. 8. Most of the values lie on 5, 6 and 7 from the range of 3 to 8 with median 6. 9. In the dataset, all the variables has outliers.

What is the structure of your dataset?

There are 1599 observations of 13 variables in the dataset and the dataset is tidy. X variable which does not give any information about wine is ignored. Among the 12 variables, 11 variables are numeric and one variable quality is categorical. I would like to do analysis on how these variables affect the quality of wine.

What is/are the main feature(s) of interest in your dataset?

Quality is main feature of interest in dataset.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Alcohol, pH and residual.sugar may affect the quality of wine.

Bivariate Plots Section

##                          X fixed.acidity volatile.acidity citric.acid
## X                     1.00         -0.27            -0.01       -0.15
## fixed.acidity        -0.27          1.00            -0.26        0.67
## volatile.acidity     -0.01         -0.26             1.00       -0.55
## citric.acid          -0.15          0.67            -0.55        1.00
## residual.sugar       -0.03          0.11             0.00        0.14
## chlorides            -0.12          0.09             0.06        0.20
## free.sulfur.dioxide   0.09         -0.15            -0.01       -0.06
## total.sulfur.dioxide -0.12         -0.11             0.08        0.04
## density              -0.37          0.67             0.02        0.36
## pH                    0.14         -0.68             0.23       -0.54
## sulphates            -0.13          0.18            -0.26        0.31
## alcohol               0.25         -0.06            -0.20        0.11
## quality               0.07          0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## X                             -0.03     -0.12                0.09
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## X                                   -0.12   -0.37  0.14     -0.13    0.25
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## X                       0.07
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00
## [1] "fixed.acidity"    "volatile.acidity" "citric.acid"     
## [4] "residual.sugar"
## [1] "chlorides"            "free.sulfur.dioxide"  "total.sulfur.dioxide"
## [4] "density"
## [1] "pH"        "sulphates" "alcohol"   "quality"

A new variable is created to divide the quality into bins that is (2, 4] as bad, (4,6] as average and (6, 8] as good. This new variable will be helpful in analyzing multivariables.

## 
## (2,4] (4,6] (6,8] 
##    63  1319   217
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality ratingbucket
## 1       5        (4,6]
## 2       5        (4,6]
## 3       5        (4,6]
## 4       6        (4,6]
## 5       5        (4,6]
## 6       5        (4,6]

Based on correlation results the positively correlated plots are as follows

As the quality increases the mean(blue point) and median of fixed.acidity fluctuates. The plot shows that the fixed.acidity does not affect the quality of the wine. The correlation is positive.

The correlation between quality and residual.sugar is very low(0.013). There is no significant increase of residual.sugar as the quality increases.

Sulphates monotonically increases with quality of wine.

Among all the properties alcohol has highest correlation with quality of wine. Alcohol monotonically increases with quality.

Negatively correlated plots are as follows

volatile.acidity, chlorides decreases gradually with increase in quality of wine. free/total.sulphur.dioxide, density and pH fluctuates at 5 and 6 but decreases with increase in quality of wine.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the . How did the feature(s) of interest vary with other features in dataset?

Based on the correlation results, fixed.acidity, citric.acid, residual.sugar, sulphates and alcohol are positively correlated and volatile.acidity, chlorides, free/total.sulphur.dioxide, density and pH are negatively correlated with quality variable.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Citric.acid and density have good correlation with fixed.acidity(0.67). Free.sulphur.dioxide and total.sulphur.dioxide also have a good correlation (0.67). Sulphates gradually increase with the quality of wine. In general, pH less than 7 is acidic. The suprising result was pH value decreases as increase in quality of wine.

What was the strongest relationship you found?

Based on research, alcohol is one of the important component in redwine that causes health issues.In this analysis process, it is shown that alcohol has very good correlation with quality.

Multivariate Plots Section

Based on data visualization of bivariate analysis, alcohol has highest positive correlation with quality among other variables. By multivariate analysis we will check the effect of other variables with alcohol and quality.

The quality of wine decreases with decrease in fixed.acidity as the alcohol increases.

There is no significant increase in quality of wine with increase in alcohol content.

Residual.sugar has many outliers. For lower values of residual.sugar the quality of wine decreases.

The quality of wine is good for higher alcohol and sulphates.

Better quality wines are more acidic with the increase of alcohol content.

Both the plots have good correlation and produces better quality with the increase of x and y axis values.

For higher range of total.sulfur.dioxide and free.sulfur.dioxide there are better wines but also have few outliers.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

Fixed.acidity and density have a good quality of wines. Quality monotonically increases with fixed.acidity and density. Yes, the amount of alcohol and pH content improves the quality of wine. But residual.sugar does not have good correlation with quality.

Were there any interesting or surprising interactions between features?

Citric.acid and fixed.acidity as well as density and fixed.acidity have better quality of wines at higher ranges. The quality of wine is good for higher ranges of alcohol and sulphate content.

Final Plots and Summary

Plot One

Description One

Among all the variables in the dataset, I thought quality is one of the main feature to analyze how other chemical properties affect the quality of wine.

Plot Two

Description Two

In general, if intake of alcohol is more then it leads to health issues. So I have assumed that there will be less alcohol content in wine. By analyzing the data, it has been shown that with the increase of alcohol content the quality of wine increases too. Hence I have chosen this plot to know the relationship between these two variables.

Plot Three

Description Three

By the analysis of redwine dataset, it has been observed that if pH is acidic the quality of wine increases with alcohol content which made my assumption wrong.

Reflection

On observing the redwine dataset I was not sure which properties would affect the quality of wine. After exploring the data by how each variable is distributed I thought of considering the quality as main feature and determine which variables would affect the quality of wine. I thought alcohol, pH and residual.sugar might help to determine the quality of wine. Residual.sugar did not go well in the analysis. The suprising result was the wines that are acidic have better quality. The alcohol content played important role in exploring the variable quality of redwine.

For future data analysis, I would like to have a dataset with different wine styles for example, fruit composition, rich and dark, long aging type and techniques used for wine making. This insight would help to explore the data and determine the quality of the wine styles.

References:

[ https://www.hsph.harvard.edu › … › Drinks to Consume in Moderation]

[ http://www.sthda.com › … › R software › Data Visualization › ggplot2 - Essentials]

[https://campus.datacamp.com/courses/introduction-to-r-for-finance/factors-4?ex=8]

[https://onlinecourses.science.psu.edu/stat857/node/223]

[http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually]